Highly Scalable Discriminative Spam Filtering

نویسندگان

  • Michael Brückner
  • Peter Haider
  • Tobias Scheffer
چکیده

This paper discusses several lessons learned from the SpamTREC 2006 challenge. We discuss issues related to decoding, preprocessing, and tokenization of email messages. Using the Winnow algorithm with orthogonal sparse bigram features, we construct an efficient, highly scalable incremental classifier, trained to maximize a discriminative optimization criterion. The algorithm easily scales to millions of training messages and millions of features. We address the composition of training corpora and discuss experiments that guide the construction of our SpamTREC entry. We describe our submission for the filtering tasks with periodical re-training and active learning strategies, and report on the evaluation on the publicly available corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Personalized Spam Filtering for Gray Mail

Gray mail, messages that could reasonably be considered either spam or good by different email users, is a commonly observed issue in production spam filtering systems. In this paper we study this class of mail using a large real-world email corpus and signaturebased campaign detection techniques. Our analysis shows that even an optimal filter will inevitably perform unsatisfactorily on gray ma...

متن کامل

A Scalable Spam Filtering Architecture

The proposed spam filtering architecture for MTA servers is a component based architecture that allows distributed processing and centralized knowledge. This architecture allows heterogeneous systems to coexist and benefit from a centralized knowledge source and filtering rules. MTA servers in the infrastructure contribute to a common knowledge, allowing for a more rational resource usage. The ...

متن کامل

Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure

The spam problem is getting worse all the time. In the paper, we propose Anti-Spam Grid, which can collaboratively filter spam messages by forming a virtual organization. We discuss the design of fuzzy CopyRank and distributed Bayesian algorithm, and describe the architecture of Anti-Spam Grid. A detailed analysis shows that the system is reliable, efficient and scalable, and an experiment show...

متن کامل

A scalable intelligent non-content-based spam-filtering framework

Designing a spam-filtering system that can run efficiently on heavily burdened servers is particularly important to the widely used email service providers (ESPs) (e.g., Hotmail, Yahoo, and Gmail) who have to deal with millions of emails everyday. Two primary challenges these companies face in spam filtering are efficiency and scalability. This study is undertaken to develop an efficient and sc...

متن کامل

Using Biased Discriminant Analysis for Email Filtering

This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples whi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006